32 research outputs found
KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles
We introduce KERT (Keyphrase Extraction and Ranking by Topic), a framework
for topical keyphrase generation and ranking. By shifting from the
unigram-centric traditional methods of unsupervised keyphrase extraction to a
phrase-centric approach, we are able to directly compare and rank phrases of
different lengths. We construct a topical keyphrase ranking function which
implements the four criteria that represent high quality topical keyphrases
(coverage, purity, phraseness, and completeness). The effectiveness of our
approach is demonstrated on two collections of content-representative titles in
the domains of Computer Science and Physics.Comment: 9 page
Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations
Dialogue summarization task involves summarizing long conversations while
preserving the most salient information. Real-life dialogues often involve
naturally occurring variations (e.g., repetitions, hesitations) and existing
dialogue summarization models suffer from performance drop on such
conversations. In this study, we systematically investigate the impact of such
variations on state-of-the-art dialogue summarization models using publicly
available datasets. To simulate real-life variations, we introduce two types of
perturbations: utterance-level perturbations that modify individual utterances
with errors and language variations, and dialogue-level perturbations that add
non-informative exchanges (e.g., repetitions, greetings). We conduct our
analysis along three dimensions of robustness: consistency, saliency, and
faithfulness, which capture different aspects of the summarization model's
performance. We find that both fine-tuned and instruction-tuned models are
affected by input variations, with the latter being more susceptible,
particularly to dialogue-level perturbations. We also validate our findings via
human evaluation. Finally, we investigate if the robustness of fine-tuned
models can be improved by training them with a fraction of perturbed data and
observe that this approach is insufficient to address robustness challenges
with current models and thus warrants a more thorough investigation to identify
better solutions. Overall, our work highlights robustness challenges in
dialogue summarization and provides insights for future research
Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours
Text classification can be useful in many real-world scenarios, saving a lot
of time for end users. However, building a custom classifier typically requires
coding skills and ML knowledge, which poses a significant barrier for many
potential users. To lift this barrier, we introduce Label Sleuth, a free open
source system for labeling and creating text classifiers. This system is unique
for (a) being a no-code system, making NLP accessible to non-experts, (b)
guiding users through the entire labeling process until they obtain a custom
classifier, making the process efficient -- from cold start to classifier in a
few hours, and (c) being open for configuration and extension by developers. By
open sourcing Label Sleuth we hope to build a community of users and developers
that will broaden the utilization of NLP models.Comment: 7 pages, 2 figure
SCENE: Structural Conversation Evolution Network
It???s not just what you say, but it is how you say it. To date, the majority of the Instant Message (IM) analysis and research has focused on the content of the conversation.The main research question has been, ???what do people talk about???? focusing on topic extraction and topic modeling. While content is clearly critical for many real-world applications, we have largely ignored identifying ???how??? people communicate. Conversation structure and communication patterns provide deep insight into how conversations evolve, and how the content is shared. Motivated by theoretical work from psychology and linguistics in the area of conversation alignment, we introduce SCENE, an evolution network approach to extract knowledge from a conversation network. We demonstrate the potential of our approach by taking the task of matching conversation partners. We find that SCENE is more successful because, in contrast to existing approaches, SCENE treats a conversation as an evolving, rather than a static document, and focuses on the structural elements of the conversation instead of being tied to the specific content
Graph-based Classification on Heterogeneous Information Networks
A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized
that strongly-typed heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for part of the objects. Learning from such labeled and unlabeled data via classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied over decades, classification on heterogeneous networks has not been explored until recently.
In this paper, we consider the transductive classification problem on heterogeneous
networked data which share a common topic. Only part of the objects in the given network are labeled, and we aim to predict labels
for all types of the remaining objects. A novel graph-based regularization framework, GNetClass, is proposed to model the link structure in information networks with arbitrary network schema and number of object/link types. Specifically, we explicitly respect the type differences by
preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the
classification accuracy over existing state-of-the-art methods.unpublishedis peer reviewe
Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications
One of the major challenges of mining topics from a large corpus is the quality of the constructed topics. While phrase-generating approaches generally produce high quality output, they do not scale very well with the size of the data. Thus, the state of the art solutions usually rely upon scalable unigram-generating methods, which do not produce high quality human-readable topics, or are forced to use external knowledge bases. Furthermore, while document collections naturally contain topics at different levels of granularity (general vs. specific), very few traditional methods focus on generating high quality hierarchical topic structures.
This dissertation presents a series of approaches that directly addresses these challenges of generating high quality phrase-based topics, both as a flat set and organized as a hierarchy, as well as some potential applications. First, we describe a framework that generates high-quality topics represented by integrated lists of mixed-length phrases. The key is adapting a phrase-centric view towards the construction and ranking of topical phrases. The approach is domain-independent, and requires neither expert supervision nor an external knowledge base. The framework is initially constructed to work on collections of short texts, such as titles of scientific documents. However, we then show how the framework can be easily and robustly extended to work on collections of longer texts, and demonstrate its applicability to human needs with a task-centric evaluation.
The dissertation then addresses the need to move beyond generating a flat set of topics, and present an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity. Another application of this technique is then presented: the task of entity role discovery. By tying entities in a community to topical phrases, users are able to explicitly understand both how and why individual entities are ranked within a specific community. A final extension is then described, which is a combined approach for constructing the hierarchy, which uses entity link information to improve the hierarchy quality
SCENE: Structural Conversation Evolution NEtwork
Abstract—It’s not just what you say, but it is how you say it. To date, the majority of the Instant Message (IM) analysis and research has focused on the content of the conversation.The main research question has been, ‘what do people talk about? ’ focusing on topic extraction and topic modeling. While content is clearly critical for many real-world applications, we have largely ignored identifying ‘how ’ people communicate. Conversation structure and communication patterns provide deep insight into how conversations evolve, and how the content is shared. Motivated by theoretical work from psychology and linguistics in the area of conversation alignment, we introduce SCENE, an evolution network approach to extract knowledge from a conversation network. We demonstrate the potential of our approach by taking the task of matching conversation partners. We find that SCENE is more successful because, in contrast to existing approaches, SCENE treats a conversation as an evolving, rather than a static document, and focuses on the structural elements of the conversation instead of being tied to the specific content. I